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ABSTRACT 

Distance metric learning (DML) is an important task that 
has found apphcations in many domains. The high compu- 
tational cost of DML arises from the large number of vari- 
ables to be determined and the constraint that a distance 
metric has to be a positive semi-definite (PSD) matrix. Al- 
though stochastic gradient descent (SGD) has been success- 
fully applied to improve the efficiency of DML, it can still be 
computationally expensive because in order to ensure that 
the solution is a PSD matrix, it has to, at every iteration, 
project the updated distance metric onto the PSD cone, an 
expensive operation. We address this challenge by develop- 
ing two strategies within SGD, i.e. mini-batch and adaptive 
sampling, to effectively reduce the number of updates (i.e., 
projections onto the PSD cone) in SGD. We also develop 
hybrid approaches that combine the strength of adaptive 
sampling with that of mini-batch online learning techniques 
to further improve the computational efficiency of SGD for 
DML. We prove the theoretical guarantees for both adap- 
tive sampling and mini-batch based approaches for DML. 
We also conduct an extensive empirical study to verify the 
effectiveness of the proposed algorithms for DML. 

Categories and Subject Descriptors 

H.3.3 [Information Storage and Retrieval]; Informa- 
tion Search and Retrieval; 1.2.6 [Artificial Intelligence]: 
Learning 

General Terms 

Algorithms, Experimentation 

Keywords 

Distance Metric Learning, Stochastic Gradient Descent, Mini- 
Batch, Adaptive Sampling 

1. INTRODUCTION 
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Distance metric learning (DML) is an important subject, 
and has found applications in many domains, including in- 
formation retrieval [14], supervised classification [19], clus- 
tering [2D], and semi-supervised clustering [5]. The objective 
of DML is to learn a distance metric consistent with a given 
set of constraints, namely minimizing the distances between 
pairs of data points from the same class and maximizing 
the distances between pairs of data points from different 
classes. The constraints are often specified in the form of 
must-links, where data points belong to the same class, and 
cannot-links, where data points belong to different classes. 
The constraints can also be specified in the form of triplets 
(xi , Xj , Xfc ) [19] , in which Xi and Xj belong to a class dif- 
ferent from that of x^ and therefore Xi and Xj should be 
separated by a distance smaller than that between Xi and 
Xfc. In this work, we focus on DML using triplet constraints 
due to its encouraging performance [7] 1181 119| . 

The main computational challenge in DML arises from 
the restriction that the learned distance metric must be a 
positive semi-definite (PSD) matrix, which is often referred 
as the PSD constraint. Early approach [2D] addressed the 
PSD constraint by exploring the technique of semi-definite 
programming (SDP) [2], which unfortunately does not scale 
to large and high dimensional datasets. More recent ap- 
proaches [3 \T8\ addressed this challenge by exploiting the 
techniques of online learning and stochastic optimization, 
particularly stochastic gradient descent (SGD), that only 
needs to deal with one constraint at each iteration. Al- 
though these approaches are significantly more efficient than 
the early approach, they share one common drawback: in 
order to ensure that the learned distance metric is PSD, 
these approaches require, at each iteration, projecting the 
updated distance metric onto the PSD cone. The projection 
step requires performing the eigen-decomposition for a given 
matrix, and therefore is computationally expensive |j. As a 
result, the key challenge in developing efficient SGD algo- 
rithms for DML is how to reduce the number of projections 
without affecting the performance of DML. 

A common approach for reducing the number of updates 
and projections in DML is to use the non-smooth loss func- 
tion. A popular choice of the non-smooth loss function is 
the hinge loss, whose derivative becomes zero when the in- 
put value exceeds a certain threshold. Many online learning 



^The computational cost is 0{d?) if we only need to com- 
pute the top eigenvectors of the distance metric and becomes 
Oid?) if all the eigenvalues and eigenvectors have to be com- 
puted for the projection step, where d is the dimensionality 
of the data. 



algorithms for DML [T] (9] [16] take advantage of the non- 
smooth loss function to reduce the number of updates and 
projections. In ^E\, the authors proposed a structure pre- 
serving metric learning algorithm (SPML) that combines a 
mini-batch strategy with the hinge loss to further reduce 
the number of updates for DML. It groups multiple con- 
straints into a mini-batch and performs only one update of 
the distance metric for each mini-batch. But, according to 
our empirical study, although SPML reduces the running 
time of the standard SGD algorithm, it results in a signif- 
icantly worse performance for several datasets, due to the 
deployment of the mini-batch strategy. 

In this work, we first develop a new mini-batch based 
SGD algorithm for DML, termed Mini-SGD. Unlike SPML 
that relies on the hinge loss, the proposed Mini-SGD algo- 
rithm uses a smooth loss function for DML. We show the- 
oretically that by using a smooth loss function, Mini-SGD 
is able to achieve similar convergence rate as the standard 
SGD algorithm but with significantly less number of up- 
dates. The second contribution of this work is to develop 
a new strategy, termed adaptive sampling, for reducing 
the number of projections in DML. The key idea of adap- 
tive sampling is to first measure the "difhculty" in classify- 
ing a constraint using the learned distance metric, and then 
perform stochastic updating based on the classification dif- 
ficulty. More specifically, given the distance metric Mt and 
triplet (x', Xj , xjj), we first measure the difficulty in classify- 
ing the triplet (x*,xj-,x^) by 74 = /(x', x*,x^; Mt), where 
^(x*,Xj-,x^; Mt) is the loss function that measures the clas- 
sification error. We then sample a binary variable Zt with 
Pr(Zt = 1) oc 7t, and only update the distance metric when 
Zt = 1. We refer to the proposed approach for DML as 
AS-SGD for short. Finally, we develop two hybrid ap- 
proaches, termed HA-SGD and HR-SGD, that combine 
adaptive sampling with mini-batch to further improve the 
computational efficiency of SGD for DML. We conduct an 
extensive empirical study to verify the effectiveness and ef- 
ficiency of the proposed algorithms for DML. 

The rest of the paper is organized as follows: Section [2] 
reviews the related work on distance metric learning and 
stochastic gradient descent with reduced number of projec- 
tion steps. Section|3]describes the proposed SGD algorithms 
for DML based on mini-batch and adaptive sampling. Two 
hybrid approaches are presented that combine mini-batch 
and adaptive sampling for DML. The theoretical guaran- 
tees for both mini-batch based and adaptive sampling based 
SGD are also presented in Section [3] Section [4] summarizes 
the results of the empirical study, and Section [5] concludes 
this work with future directions. 



2. RELATED WORK 

Many algorithms have been developed to learn a linear 
distance metric from pairwise constraints, where must-links 
include pairs of data points from the same class and cannot- 
links include pairs of data points from different classes ( [21] 
and references therein). Besides pairwise constraints, an al- 
ternative strategy is to learn a distance metric from a set of 
triplet constraints (x*, x*, x^), t = 1, . . . ,N, where x* is ex- 



methods are based on stochastic gradient descent. At each 
iteration, they randomly sample one constraint, and update 
the distance metric based on the sampled constraint. The 
updated distance metric is further projected onto the PSD 
cone to ensure that it is PSD. Although these approaches 
are significantly more scalable than the batch learning al- 
gorithms for DML [1^, they suffer from the high computa- 
tional cost in the projection step that has to be performed at 
every iteration. A common approach for reducing the num- 
ber of projections is to use a non-smooth loss function, such 
as the hinge loss. In addition, in [18], the authors proposed a 
structure preserving metric learning (SPML) that combines 
mini-batch with the hinge loss to further reduce the number 
of projections. The main problem with the approach pro- 
posed in [18] is that according to the theory of mini-batch, 
it only works well with a smooth loss. Since the hinge loss is 
a non-smooth loss function, combining mini-batch with the 
hinge loss may result in a suboptimal performance. This is 
verified by our empirical study in which we observed that 
the distance metric learned by SPML performs significantly 
worse than that learned by the standard stochastic gradi- 
ent descent method. We resolve this problem by presenting 
a new SGD algorithm for DML that combines mini-batch 
with a smooth loss, instead of the hinge loss. 

Finally, it is worthwhile mentioning several recent studies 
proposed to avoid projections in SGD. In [13], the authors 
developed a projection free SGD algorithm that replaces the 
projection step with a constrained linear programming prob- 
lem. In [17], the authors proposed a SGD algorithm with 
only one projection that is performed at the end of the itera- 
tions. Unfortunately, the improvement of the two algorithms 
in computational efficiency is limited, because they require 
computing, at each iteration, the minimum eigenvalue and 
eigenvector of the updated distance metric, an operation 
with 0{cf) cost, where d is the dimensionality of the data. 



3. IMPROVED SGD FOR DML BY MINI- 
BATCH AND ADAPTIVE SAMPLING 

We first review the basic framework of DML with triplet 
constraints. We then present two strategies to improve the 
computational efficiency of SGD for DML, one by mini-batch 
and one by adaptive sampling. We present the theoretical 
guarantees for both strategies, and defer more detailed anal- 
ysis to the appendix. At the end of this section, we present 
two hybrid approaches that combine mini-batch with adap- 
tive sampling for more efficient DML. 

3.1 DML with Triplet Constraints 

Let A' C R'* be the domain for input patterns, where 
d is the dimensionality. For the convenience of analysis, 
we assume all the input patterns with bounded norm, i.e. 
Vx € X, |x|2 < r. Given a distance metric M G R'"^'', the 
distance square between Xa and x;,, denoted by |xa — Xf,||f , 
is measured by 



Xa 



(2 
XblM 



(Xa - X5) AI{:X.a - Xb) 



pected to be closer to x^- than to Xj. . Previous studies [7] 1181 
119] showed that triplet constraints could be more effective 
for DML than pairwise constraints. 

Several online algorithms have been developed to reduce 
the computational cost of DML [3 IS] [H] [16] . Most of these 



Let n = {M : M ^ 0, \\M\\f < R} be the domain for dis- 
tance metric M, where R specifies the domain size. Let 
V — {(x; ,xj,x^), . . . , (xf ,x^,x^)} be the set of triplet 
constraints used for DML, where x* is expected to be closer 
to Xj than to x|. Let £{z) be the convex loss function. De- 
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fine A(x', X*, xj,; M) as 

A(x-,x*,Xfc;M) = |x- -Xfel^/ - Ix* -Xjli^ 

= \M, (x* - x|)(x' - Xfe)^ - (x' - x*)(x' - x 

= {M,At) 

where 

Ai = (xj - xU(x* - xl)^ - (xj - x5)(xj - x5)^ 

Given the triplet constraints in 2? and the domain in il, we 
learn an optimal distance metric M £ R''^'' by solving the 
following optimization problem 

1 ^ 
min £(A.f) = -^^(A(xLxl,x',;Af)) (1) 



Men 



The key idea of online DML is to update the distance met- 
ric based on one sampled constraint at each iteration. More 
specifically, at iteration t, it samples a triplet constraint 
(x*,x^,x^), and updates the distance metric Mt to Mt+i 
by 

Mt+i = Hfi [Mt - 77/(A(x*, X* , x'fc; Mt))At) 

where ?7 > is the step size, I' {■) is the derivative and 
^n{M) projects a matrix M onto the domain Q.. The follow- 
ing proposition shows Tln{M) can be computed in two steps, 
i.e. first projecting M onto the PSD cone, and then scaling 
the projected M to fit in with the constraint ||M||f < R- 



Proposition 1. 



nn(M) 



We have 



max(||M'||F/i?, 1) 



M' 



where M' = P{M) and P{M) projects matrix M onto the 
PSD cone. 

As indicated by Proposition [T] X\a{M) requires projecting 
distance metric M onto the PSD cone, an expensive opera- 
tion that requires eigen-decomposition of M. 

Finally, to bound both the regret and the number of up- 
dates, in this study, we approximate the hinge loss by a 
smooth loss function 



^(z) = ilog(l~fexp(-L(z-l))) 



(2) 



where L > is a parameter that controls the approximation 
error: the larger the L, the closer l(z) is to the hinge loss. 
Note that the smooth approximation of the hinge loss was 
first suggested in [23] for classification and was later verified 
by an empirical study in [^ . The key properties of the loss 
function l{z) in ^ are given in the following proposition. 

Proposition 2. For the loss function defined in H]), we 
have 

VzGR, \l'{z)\ < 1, \£'{z)\ <M{z) 

Compared to the hinge loss function, the main advantage of 
the loss function in Q is that it is a smooth loss function. 
As will be revealed by our analysis, it is the smoothness of 
the loss function that allows us to effectively explore both 
the mini-batch and adaptive sampling strategies for more 
efficient DML without having to sacrifice the prediction per- 
formance. 



Algorithm 1 Mini-batch Stochastic Gradient Descent 
(Mini-SGD) for DML 

1: Input: triplet constraints {(x', x*, x|)}^i, step size rj, 

mini-batch size b, and domain size R 
2: Initialize Mi = / and T = N/b 
3: for i = l,...,r do 

4; Sample b triplet constraints {(x*'^, x*''',x^''')}5^i 
5; Update the distance metric by 

Mt+i = nn(A/t-r?V^t(Mt)) 



6: end for 

7: return M = ^Yl 



^Mt 



3.2 Mini-batch SGD for DML (Mini-SGD) 

Mini-batch SGD improves the computational efficiency of 
online DML by grouping multiple constraints into a mini- 
batch and only updating the distance metric once for each 
mini-batch. For brevity, we will refer to this algorithm as 
Mini-SGD in the rest of the paper. 

Let b be the batch size. At iteration t, it samples b triplet 
constraints, denoted by 



/ E.S E.S t.S\ 1 7 

(x-' ,x/ ,x^ ),s=l,...,b, 
and defines the mini-batch loss at iteration t as 



t{Mt) = lj2HM<'''^': 



t,S t,S 



Mt)) 



Mini-batch DML updates the distance metric Mt to Mt+i 
using the gradient of the mini-bach loss function lt{M), i.e.. 



A//t. 



nn{Mt-ri^lt{Mt)) 



Algorithm [T] gives the detailed steps of Mini-SGD for DML, 
where step 5 uses Proposition 1 for computing the projection 

na(-)- 

The theorem below provides the theoretical guarantee for 
the Mini-SGD algorithm for DML using the smooth loss 
function defined in @. 

Theorem 1. Let M be the solution output by Algorithm\7\ 
that uses the loss function defined in (^. Let Af, be the 
optimal solution to {Ip. Assume \\At\\F < A for any triplet 
constraint. For a fixed S £ (0, 1), we have, with a probability 
1 - 2S: 



C{M) < 



C{M, 



+ 



bR" 



+ 



1 - 3r]LA^ 

CiA^Tj 
(1 - 3riLA^)N 



2{l - 3r]LA'^)riN 



, 2iV 



logy 



(3) 



where m = [logj A'^] , and C\ is an universal constant that 
is at most 32. 

Figure [T] shows the reduction in the training error over the 
number of triplet constraints by the Mini-SGD algorithm on 
three datasetsQ Compared to the standard SGD algorithm, 
we observe that Mini-SGD converges to a similar value of 
training error, thus validating our theorem empirically. 
Remark 1 We observe that the second term in the upper 
bound in ((H)), i.e., bR^ /[l{l — 'ir]LA^)r)N\, has a linear depen- 
dence on mini-batch size b, implying that the larger the b, 

^The information of these datasets can be found in the ex- 
perimental section. 



Algorithm 2 Adaptive Sampling Stochastic Gradient De- 
scent (AS-SGD) for DML 

1; Input: triplet constraints {(x*,x*-,x^)}^i, step size r], 

and domain size R 
2: Initialize Mi = / 
3: for i = l,...,Af do 
4: Sample a binary random variable Zt with 

Pr(Zt = l) = |/(A(xU5,xl-;A/0! 

5: if Zt = l then 

6: Update the distance metric by 

n = sign(/(A(xhx^xl;;Mt)) 
Mt+i = Un [Mt - vnAt) 

7; end if 
8: end for 

9: return M = i E^i Mt 



the less accurate the distance metric learned by Algorithm[T] 
Hence, by adjusting parameter b, the size of mini-batch, we 
are able to make appropriate tradeoff between the predic- 
tion accuracy and the computational efficiency: the smaller 
the b, the more accurate the distance metric but with more 
updates and consequentially higher computational cost. Fi- 
nally, it is worthwhile comparing Theorem [T] to the theo- 
retical result for a general mini-batch SGD algorithm given 
in [8], i.e. 



C{M) <C{AL) + 






(4) 



It is clear that Theorem [T] gives a significantly better result 
when the optimal loss C{M^) is small (i.e. when the triplet 
constraints can be well classified by the optimal distance 
metric Af*). In particular, when £(M«) = 0{b/N), the con- 
vergence rate given in Theorem [l] is on the order of 0{b/N) 
while the convergence rate in ^ is only 0(l/v'iV). 

3.3 Adaptive Sampling based SGD for DML 
(AS-SGD) 

We now develop a new approach for reducing the number 
of updates in SGD in order to improve the computational 
efficiency of DML. Instead of updating the distance metric 
at each iteration, the proposed strategy introduces a random 
binary variable to decide if the distance metric Mt will be 
updated given a triplet constraint (x*,Xj,x|). More specif- 
ically, it computes the derivative £'(A(x* "* 
samples a random variable Zt with probability 

Pr(2-, = l) = l/(A(x*,x*,x*fc;MO)| 

The distance metric will be updated only when Zt = 1. Ac- 
cording to Proposition 2, we have |£'(A(x*,x*, x^.; Aft))| < 
L^(A(Xi, X*, x|.; Mt)) for the smooth loss function given in ([2]), 
implying that a triplet constraint has a high chance to be 
used for updating the distance metric if it has a large loss. 
Therefore, the essential idea of the proposed adaptive sam- 
pling strategy is to give a large chance to update the dis- 
tance metric when the triplet is difficult to be classified and 
a low chance when the triplet can be classified correctly 
with large margin. We note that an alternative strategy 



Xj,Xfe;Mt)), and 



is to sample a triplet constraint (x*,x*, x^) base on its loss 
i'(A(xi,Xj, x^; Mt)). We did not choose the loss as the ba- 
sis for updating because it is the derivative, not the loss, 
that will be used by SGD for updating the distance met- 
ric. The detailed steps of adaptive sampling based SGD for 
DML is given in Algorithm (2] We refer to this algorithm as 
AS-SGD for short in the rest of this paper. 

The theorem below provides the performance guarantee 
for AS-SGD. It also bounds the number of updates X]t=i ^t 
for AS-SGD. 

Theorem 2. Let M he the solution output by Algorithm\^ 
that uses the loss function defined in |3i. Let Af, be the 
optimal solution to {Ip. Assume \\At\\F < A for any triplet 
constraint. For a fixed 5 G (0, 1), we have, with a probability 
1 - 25: 



£{M) < 
and 

where 



Ca 



1 - 3r]LA^ (1 - 3r]LA^)N 



— +n- 



N N 

J2Zt<lLj:iiMt) + lln^ 



il .,^, "15,2, m„,, 2m 1 

max-^ - + 16\n —, -A^ \n —, RAln ^ } 

'2 5 4 S S j 



(5) 



(6) 



m = riog2(A'^)l 

Remark 2 The bound given in ([5]) shares similar structure 
as that given in Q except that it does not have mini-batch 
size b that can be used to make tradeoff between the num- 
ber of updates and the classification accuracy. The num- 
ber of updates performed by Algorithm [2] is bounded by 
dSl). The dominate term in (O is 0{Y,^^i£{Mt)), imply- 
ing that Algorithm [2] will have a small number of updates 
if the learned distance metric Mt can classify the triplet 
constraint correctly at most iterations. In other words, the 
smaller the number of classification mistakes made by the 
learned distance metric Mt, the less number of updates will 
be performed by Algorithm [2] We validate the theorem by 
running the AS-SGD algorithm on three datasets. Figure \T\ 
shows the reduction in the training error over the number 
of triplet constraints by AS-SGD and the standard SGD al- 
gorithm. We observe that AS-SGD converges to a similar 
value of training error as the full SGD algorithm. 

3.4 Hybrid Approaches: Combine Mini-batch 
with Adaptive Sampling for DML 

Since mini-batch and adaptive sampling improve the com- 
putational efficiency of SGD from different aspects, it is nat- 
ural to combine them together for more efficient DML. Sim- 
ilar to the Mini-SGD algorithm, the hybrid approaches will 
group multiple triplet constraints into a mini-batch. But, 
unlike Mini-SGD that updates the distance metric for every 
mini-batch of constraints, the hybrid approaches follow the 
idea of adaptive sampling, and introduce a binary random 
variable to decide if the distance metric will be updated for 
every mini-batch of constraints. By combining the strength 
of mini-batch and adaptive sampling for SGD, the hybrid 
approaches are able to make further improvement in the 
computational efficiency of DML. Algorithm|3]highlights the 
key steps of the hybrid approaches. 
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Figure 1: The convergence of different SGD algorithms 



Algorithm 3 A Framework of Hybrid Stochastic Gradient 
Descent (Hybrid-SGD) for DML 



1; Input: triplet constraints {(x*,Xj, x^)}f 

mini-batch size b, and domain size R 
2: Initialize Mi = I and T = N/b 
3: for i = 1, . . . , r do 

4: Sample b triplets {x*'^x*'^x5;,'^}^=l. 

5; ' 



step size 77, 



Compute sampling probability 74. 



Sample a binary random variable Zt with 
Pr{Zt = 1) = 7t 



7; 


if Zt — 1 then 


8: 


Update the distance metric by 




Tt = l/7t 




Mt+i = nn(Mt-r,Tt\7£t{Mt)) 


9; 


end if 


10: 


end for 


11: 


return M = i J2l=i Mt 



One of the key steps in the hybrid approaches (step 5 in 
Algorithm [H]) is to choose appropriate sampling probabil- 
ity 7t for every mini-batch constraints (x'''',x*''',x^'''), s — 
l,...,b. In this work, we study two different choices for 
sampling probability 74: 

• The first approach chooses ft based on a triplet con- 
straint randomly sampled from a mini-batch. More 
specifically, given a mini-batch of triplet constraints 
{x*''',x*''',x5.'^}g=i, it randomly samples an index s' 
in the range [1, b]. It then sets the sampling probabil- 
ity 7t to be the derivative for the randomly sampled 
triplet, i.e., 

'yt = \£'{A{^l'^'y/,^r';Mt))\ 
We refer to this approach as HR-SGD. 

• The second approach is based on the average case 
analysis. It sets the sampling probability as the av- 
erage derivative measured by the norm of the gradient 
V£t{Mt), i.e.. 



lt = ^\\VitiMt)\\F 



Table 1: Statistics for the ten datasets used in our 
empirical study. 





# class 


# feature 


# train 


# test 


semeion 


10 


256 


1,115 


478 


dna 


3 


180 


2,000 


1,186 


isolet 


26 


617 


6,238 


1,559 


tdtSO 


30 


200 


6,575 


2,819 


letter 


26 


16 


15,000 


5,000 


protein 


3 


357 


17,766 


6,621 


connect4 


3 


42 


47,289 


20,268 


sensit 


3 


100 


78,823 


19,705 


rcv20 


20 


200 


477,141 


14,185 


poker 


10 


10 


1,000,000 


25,010 



where W = maxt ||V€t(Mt)|]F and is estimated by 
sampling. We refer to this approach as HA-SGD. 

4. EXPERIMENTS 

Ten datasets are used to validate the effectiveness of the 
proposed algorithms. Table [l] summarizes the information 
of these datasets. Datasets dna, letter [15], protein and sen- 
sit [To] are downloaded from LIBSVM [5]. Datasets tdt30 
and rcv20 are document corpora: tdtSO is the subset of 
tdt2 data [3] comprised of the documents from the 30 most 
popular categories and rcv20 is the subset of a large rcvl 
dataset [T] consisted of documents from the 20 most popular 
categories. We reduce the dimensionality of these document 
datasets to 200 by principle components analysis (PGA). All 
the other datasets are downloaded directly from the UGI 
repository [11] . For most datasets used in this study, we use 
the standard training/testing split provided by the original 
dataset, except for datasets semeion, connect4 and tdt30. 
For these three datasets, we randomly select 70% of data for 
training and use the remaining 30% for testing; experiments 
related to these three datasets are repeated ten times, and 
the prediction result averaged over ten trials is reported. All 
experiments are implemented on a laptop with 8GB memory 
and two 2.50GHz Intel Core i5-2520M CPUs. 

4.1 Parameter Setting 

The parameter L in the loss function ((2} is set to be 3 
according to the suggestion in [23]. We set TV = 100,000 
for the number of iterations (i.e., the number of triplet con- 
straints). To construct a triplet constraint at each iteration 



t, we first randomly sample an example (x',2/|) from the 
training data; we then find two of its nearest neighbors x' 
and xj., measured by Euclidean distance, from the training 
examples, with x* sharing the same class label as x' and x^ 
belonging to a class different from y\. For Mini-SGD and the 
hybrid approaches, we set 6 = 10 for the size of mini-batch 
as in [18] , leading to a total of T = 10, 000 iterations for 
these approaches. We evaluate the learned distance metric 
by the classification error of a fc-NN on the test data, where 
the number of nearest neighbors k is set to be 3 based on 
our experience. 

Parameter R in the proposed algorithms determines the 
domain size for the distance metric to be learned. We ob- 
serve that the classification error of fe-NN remains almost un- 
changed when varying R in the range of {100, 1000, 10000}. 
We thus set _R = 1, 000 for all the experiments. Another im- 
portant parameter used by the proposed algorithms is the 
step size t). We evaluate the impact of step size r) by mea- 
suring the classification error of a fc-NN algorithm that uses 
the distance metric learned by the Mini-SGD algorithm with 
rj = {0.1, 1, 10}. We observe that rj = 1 yields a low classifi- 
cation error for almost all datasets by cross-validation with 
i? = 1, 000 and T = 10. We thus fix 77 = 1 for the proposed 
algorithms in all the experiments. 

4.2 Experiment (I): Effectiveness of the Pro- 
posed SGD Algorithms for DML 

In this experiment, we compare the performance of the 
proposed SGD algorithms for DML, i.e., Mini-SGD, AS- 
SGD and two hybrid approaches (HR-SGD and HA-SGD), 
to the full version of SGD for DML (SGD). We also include 
Euclidean distance as the reference method in our compari- 
son. Table [2] shows the classification error of fc-NN (fc = 3) 
using the distance metric learned by different DML algo- 
rithms. First, it is not surprising to observe that all the 
distance metric learning algorithms improve the classifica- 
tion performance of fc-NN compared to the Euclidean dis- 
tance. Second, for almost all datasets, we observe that all 
the proposed DML algorithms (i.e., Mini-SGD, AS-SGD, 
HR-SGD, and HA-SGD) yield similar classification perfor- 
mance as SGD, the full version of SGD algorithm for DML. 
This result confirms that the proposed SGD algorithms are 
effective for DML despite the modifications we made to the 
SGD algorithm. 

4.3 Experiment (II) : Efficiency of the Proposed 
SGD Algorithms for DML 

Table. [3] summarizes the running time for the proposed 
DML algorithms and the SGD method. We note that the 
running time in Table [3] does not take into account the time 
for constructing triplet constraints since it is shared by all 
the methods in comparison. 

It is not surprising to observe that all the proposed SGD 
algorithms, including Mini-SGD, AS-SGD, HA-SGD and HR- 
SGD, significantly reduce the running time of SGD. For in- 
stance, for dataset isolet, it takes SGD more than 32,000 
seconds to learn a distance metric, while the running time is 
reduced to less than 3, 500 seconds when applying the pro- 
posed SGD algorithms, roughly a factor of 10 reduction in 
running time. Comparing the running time of AS-SGD to 
that of Mini-SGD, we observe that each method has its own 
advantage: AS-SGD is more efficient on datasets semeion, 
dna, isolet, and tdt30, while Mini-SGD is more efficient on 



the other six datasets. This is because different mechanisms 
are employed by AS-SGD and Mini-SGD to reduce the com- 
putational cost: AS-SGD improves the computational effi- 
ciency of DML by skipping the constraints that are easy to 
be classified, while Mini-SGD improves the the computa- 
tional efficiency of SGD by performing the updating of dis- 
tance metric once for multiple triplet constraints. Finally, 
we observe that the two hybrid approaches that combine the 
strength of both adaptive sampling and mini-batch SGD, 
are computationally most efficient for almost all datasets. 
We also observe that HR-SGD appears to be more efficient 
than HA-SGD on six datasets and only loses on datasets 
protein, sensit and rcv20. This is because HR-SGD com- 
putes the sampling probability "ft based on one randomly 
sampled triplet while HA-SGD needs to compute the aver- 
age derivative for each mini-batch of triplet constraints for 
the sampling probability. 

To further examine the computational efficiency of pro- 
posed SGD algorithms for DML, we summarize in Table U 
the number of updating performed by different SGD algo- 
rithms. We observe that all the proposed SGD algorithms 
for DML are able to reduce the number of updates signifi- 
cantly compared to SGD. Comparing Mini-SGD to AS-SGD, 
we observe that for some datasets (e.g., semeion, dna, isolet, 
and tdtSO), the number of updates performed by AS-SGD is 
significantly less than Mini-SGD, while it is the other way 
around for the other datasets. This is again due to the fact 
that AS-SGD and Mini-SGD deploy different mechanisms 
for reducing computational costs. As we expect, the two 
hybrid approaches are able to further reduce the number 
of updates performed by AS-SGD and Mini-SGD, making 
them more efficient algorithms for DML. 

By comparing the results in Table [3] to the results in Ta- 
ble O we observe that a small number of updates does NOT 
always guarantee a short running time. This is exhibited 
by the comparison between the two hybrid approaches: al- 
though HA-SGD performs the similar number of updates as 
HR-SGD on datasets dna and isolet, it takes HA-SGD signif- 
icantly longer time to finish the computation than HR-SGD. 
This is also exhibited by comparing the results across dif- 
ferent datasets for a fixed method. For example, for the 
HA-SGD method, the number of updates for the protein 
dataset is nearly the same as that for the poker dataset, 
but the running time for the protein dataset is about 50 
times longer than that for the poker dataset. This result 
may sound counter intuitive at the first glance. But, a more 
careful analysis reveals that in addition to the number of 
updates, the running time of DML is also affected by the 
computational cost per iteration, which explains the con- 
sistency between Table [3] and |4] In the case of compar- 
ing the two hybrid approaches, we observe that HA-SGD is 
subjected to a higher computational cost per iteration than 
HR-SGD because HA-SGD has to compute the norm of the 
average gradient over each mini-batch while HR-SGD only 
needs to compute the derivative of one randomly sampled 
triplet constraint for each mini-batch. In the case of com- 
paring the running time across different datasets, the protein 
dataset has a significantly higher dimensionality than the 
poker dataset, and therefore is subjected to a higher compu- 
tational cost per iteration because the computational cost 
of projecting an updated distance metric onto the PSD cone 
increases at least quadratically in the dimensionality. 



Table 2: Classification error (%) of fc-NN (k — 3) using the distance metrics learned by different SGD methods, 
online learning algorithms and batch learning approach for DML. 





Baseline 


Batch 


Online Learning 




Proposed Methods | 




Euclidean 


LMNN 


LEGO 


OASIS 


SPML 


SGD 


Mini-SGD 


AS-SGD 


HR-SGD 


HA-SGD 


semeion 


8.7 


9.0 


11.9 


8.3 


6.3 


6.3 


6.5 


6.3 


6.4 


6.2 


dna 


20.7 


6.2 


9.3 


16.6 


9.1 


8.6 


9.4 


8.4 


8.1 


8.1 


isolet 


9.0 


5.4 


8.3 


6.5 


6.6 


6.3 


6.2 


6.0 


6.4 


6.1 


tdtSO 


5.3 


3.0 


14.6 


4.0 


3.7 


3.8 


3.7 


3.7 


3.8 


3.6 


letter 


4.4 


3.2 


4.0 


2.2 


3.1 


2.1 


2.5 


2.1 


2.5 


2.3 


protein 


50.0 


40.1 


42.4 


40.1 


41.9 


40.7 


38.9 


40.7 


41.0 


40.9 


connect4 


29.5 


21.1 


25.8 


22.1 


24.5 


20.1 


20.1 


20.1 


22.2 


20.4 


sensit 


27.3 


24.3 


25.4 


24.1 


23.7 


24.0 


24.0 


24.0 


24.4 


24.6 


rcv20 


9.1 


N/A 


8.9 


8.6 


8.9 


8.5 


8.7 


8.4 


8.4 


8.6 


poker 


38.0 


N/A 


39.2 


36.1 


37.8 


35.0 


33.8 


35.0 


34.3 


34.4 



Table 3: Running time (seconds) for different SGD methods, online learning algorithms and batch learning 
approach for DML. Note that LMNN, a batch DML algorithm, is mainly implemented in C, while the other 
algorithms in comparison are implemented in Matlab, virhich is usually less efficient than C. 





Batch 


On ino Learning 




Proposed Methods | 




LMNN 


LEGO 


OASIS 


SPML 


SGD 


Mini-SGD 


AS-SGD 


HR-SGD 


HA-SGD 


semeion 


112.7 


355.8 


29.1 


206.6 


2,172.4 


263.2 


45.2 


7.4 


42.4 


dna 


255.9 


330.2 


39.1 


122.1 


1,165.3 


121.0 


30.6 


7.1 


28.0 


isolet 


2,454.3 


3,454.2 


515.7 


3,017.2 


32,762.7 


3,440.7 


908.4 


127.6 


246.3 


tdt30 


264.5 


372.6 


51.2 


145.1 


1,351.0 


148.0 


108.8 


11.6 


41.6 


letter 


251.6 


15.0 


10.8 


5.6 


27.3 


5.3 


10.9 


1.8 


3.2 


protein 


3,906.4 


1,318.9 


3,825.9 


573.8 


5,448.9 


580.6 


1,335.8 


184.5 


145.6 


connect4 


540.2 


23.1 


79.0 


16.4 


109.6 


15.9 


60.5 


8.0 


6.97 


sensit 


10,481.2 


93.3 


303.9 


44.3 


365.4 


41.3 


243.9 


26.2 


17.9 


rcv20 


N/A 


443.6 


1,313.7 


154.4 


1,542.1 


158.4 


932.9 


101.4 


45.8 


poker 


N/A 


17.3 


17.6 


5.8 


21.0 


4.5 


13.5 


2.8 


3.4 



4.4 Experiment (III): Comparison with State- 
of-the-art Online DML Methods 

We compare the proposed SGD algorithms to three state- 
of-the-art online algorithms and one bath method for DML: 

• SPML [18]: an online learning algorithm for DML 
that is based on mini-batch SGD and the hinge loss, 

• OASIS [?]: a state-of-the-art online DML algorithm, 

• LEGO [16| : an online version of the information the- 
oretic based DML algorithm [9]. 

Finally, for sanity checking, we also compare the proposed 
SGD algorithms to LMNN [19], a state-of-the-art batch 
learning algorithm for DML. 

Both SPML and OASIS use the same set of triplet con- 
straints to learn a distance metric as the proposed SGD al- 
gorithms. However, unlike SPML and OASIS, pairwise con- 
straints are used by LEGO for DML. For fair comparison, we 
generate the pairwise constraints for LEGO by splitting each 
triplet constraint (x',x5,xj,) into two pairwise constraints: 
a must-link constraint (x* , x* ) and a cannot-link constraint 
(x*,x^.). This splitting operation results in a total of 200, 000 
pairwise constraints for LEGO. Finally, we note that since 
LMNN is a batch learning method, it is allowed to utilize 
any triplet constraint derived from the data, and is not re- 
stricted to the set of triplet constraints we generate for the 
SGD methods. All the baseline DML algorithms are imple- 
mented by using the codes from the original authors except 
for SPML, for which we made appropriate changes to the 
original code in order to avoid large matrix multiplication 



and improve the computational efficiency. SPML, OASIS 
and LEGO are implemented in Matlab, while the core parts 
of LMNN are implemented by C that is usually deemed to 
be more efficient than Matlab. The default parameters sug- 
gested by the original authors are used in the baseline al- 
gorithms. The step size of LEGO is set to be 1, as it was 
observed in [7] that the prediction performance of LEGO 
is in general insensitive to the step size. In all experiments, 
all the baseline methods set the initial solution for distance 
metric to be an identity matrix. 

Table. [2] summarizes the classification results of fc-NN 
{k = 3) using the distance metrics learned by the four 
baseline algorithms. First, we observe that LEGO performs 
significantly worse than the proposed DML algorithms for 
five datasets, including semeion, isolet, tdt30, connect4, and 
poker. This can be explained by the fact that LEGO uses 
pairwise constraints for DML while the other methods in 
comparison use triplet constraints for DML. According to [T] 
1181 119| , triplet constraints are in general more effective than 
pairwise constraints. Second, although both SPML and 
Mini-SGD are based on the mini-batch strategy, SPML per- 
forms significantly worse than Mini-SGD on three datasets, 
i.e. protein, connect4, and poker. The performance differ- 
ence between SPML and Mini-SGD can be explained by 
the fact that Mini-SGD uses a smooth loss function while a 
hinge loss is used by SPML. According to our analysis and 
the analysis in [3, using a smooth loss function is critical for 
the success of the mini-batch strategy. Third, OASIS yields 



Table 4: The number of updates for different SGD methods and online learning algorithms for DML. 





Online Learning 




Proposed Methods | 




LEGO 


OASIS 


SPML 


SGD 


Mini-SGD 


AS-SGD 


HR-SGD 


HA-SGD 


semeion 


71,142.4 


432.7 


10,000 


100,000 


10,000 


142.2 


101.4 


162.8 


dna 


140,027 


2,042 


10,000 


100,000 


10,000 


707 


351 


372 


isolet 


110,175 


1,426 


10,000 


100,000 


10,000 


1,893 


353 


378 


tdtSO 


131,997.6 


2,284.6 


10,000 


100,000 


10,000 


5,563.7 


567.6 


784.6 


letter 


130,794 


28,063 


10,000 


100,000 


10,000 


12,931 


1,398 


457 


protein 


166,384 


64,804 


10,000 


100,000 


10,000 


22,127 


3,064 


1,623 


connect4 


153,311.6 


69,865 


10,000 


100,000 


10,000 


44,510.8 


4,161.2 


2,134.3 


sensit 


162,869 


78,223 


10,000 


100,000 


10,000 


60,028 


5,675 


1,281 


rcv20 


137,246 


88,476 


10,000 


100,000 


10,000 


60,708 


6,095 


779 


poker 


179,714 


71,620 


10,000 


100,000 


10,000 


43,259 


4,111 


1,635 



similar performance as the proposed algorithms for almost 
all datasets except for datasets semeion, dna and poker, for 
which OASIS performs significantly worse. Overall, we con- 
clude that the proposed DML algorithms yield similar, if 
not better, performance as the state-of-the-art online learn- 
ing algorithms for DML. 

Compared to LMNN, a state-of-the-art batch learning al- 
gorithm for DML, we observe that the proposed SGD al- 
gorithms yield similar performance on three datasets. They 
however perform significantly better than LMNN on datasets 
semeion and letter, and significantly worse on datasets dna, 
isolet and tdt30. We attribute the difference in classifica- 
tion error to the fact that the proposed DML algorithms are 
restricted to 100, 000 randomly sampled triplet constraints 
while LMNN is allowed to use all the triplet constraints that 
can be derived from the data. The restriction in triplet con- 
straints could sometimes limit the classification performance 
but at the other time help avoid the overfitting problem. We 
also observe that LMNN is unable to run on the two large 
datasets rcv20 and poker, indicating that LMNN does not 
scale well to the size of datasets. 

The running time and the number of updates of the base- 
line online DML algorithms can be found in Table [3] and 
Table 31 respectively. It is not surprising to observe that the 
three online DML algorithms are significantly more efficient 
than SGD in terms of both running time and the number of 
updates. We also observe that Mini-SGD and SPML share 
the same number of updates and similar running time for 
all datasets because they use the same mini-batch strategy. 
Furthermore, compared to the three online DML algorithms, 
the two hybrid approaches are significantly more efficient 
in both running time and the number of updates. Finally, 
since LMNN is implemented by C, it is not surprising to ob- 
serve that LMNN shares similar running time as the other 
online DML algorithms for relatively small datasets. It is 
however significantly less efficient than the online learning 
algorithms for datasets of modest size (e.g. connect^ and 
sensit), and becomes computationally infeasible for the two 
large datasets rcv20 and poker. Overall, we observe that the 
two hybrid approaches are significantly more efficient than 
the other DML algorithms in comparison. 

5. CONCLUSION 

In this paper, we propose two strategies to improve the 
computational efficiency of SGD for DML, i.e. mini-batch 
and adaptive sampling. The key idea of mini-batch is to 



group multiple triplet constraints into a mini-batch, and 
only update the distance metric once for each mini-batch; 
the key idea of adaptive sampling is to perform stochastic 
updating by giving a difficult triplet constraint more chance 
to be used for updating the distance metric than an easy 
triplet constraint. We develop theoretical guarantees for 
both strategies. We also develop two variants of hybrid ap- 
proaches that combine mini-batch with adaptive sampling 
for more efficient DML. Our empirical study confirms that 
the proposed algorithms yield similar, if not better, pre- 
diction performance as the state-of-the-art online learning 
algorithms for DML but with significantly less amount of 
running time. Since our empirical study is currently limited 
to datasets with relatively small number of features, we plan 
to examine the effectiveness of the proposed algorithms for 
DML with high dimensional data. 
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APPENDIX 

The analysis for Theorem [l] is in the supplementary docu- 
ment |_| and we give the proof for Theorem [2] here. Define: 



yt — \ 

•■N 



Cn = EtLi \t\Mt)\, Xt = Zt~ \t'{Mt)\ 
A 



jv = X^i<(<jv A't, K= max Xt<l, 

^^i^t^iv l<t<JV 

^l = Eti E[{Zt - |£'(MO|)^r< Ef=i \nMt)\ = Cm 
Using Berstein inequality for martingales [4], we have: 

Pr(Ajv > 2VC^+y/2KT/i) 
= Pr(Ajv > 2^/CV^ + ^A'r/3, cj%<Cn,Cn < N) 
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Cn < l/N 
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< Pr(Civ < l/N) 



+ ^ Pr Ajv > ^2-r + V2Kt/3, '^l < j^ 
< Pr(Civ < 1/iV) + me'^ 



where m = [loggCA'^^)]. By setting me '^ ~ 5, with a prob- 
ability 1 — S, the number of updates can be bounded as: 

Zt < CN + T:CN + 2ln— + ^Kln — 
Z 6 



< ^E^(A^O + ^lnf 



(7) 



Then, we give the regret bound. Using the standard anal- 
ysis for online learning |3], we have: 

l{Mt) - e{M,) < {£'{Mt)At, Mt - M,) 
= TtZt{At,Mt-M,) 

+ il'{Mt) - rtZt){At,Mt - M*) 
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+Tt{\e'{Mt)\ - Zt){At,Mt - M.) 

Taking the sum from i = 1 to Af, we have: 

t=i ' t=i 

JV 

+ Y,'2rt{\l'{Mt)\-Zt)RA 
t=i 

According to (0, with a probability 1 ~ S, the second item 
could be bounded as: 



< ':L^J2e{Mt) + jvA^ln 



4 ' 5 



(8) 



where 7 > rjLA? . 

Applying Berstein inequality for martingales [4] for the 
last item, we have, with a probability 1 — 5: 

2^2 J 



Et=i 2Tt{\l'{Mt)\ - Zt)RA < 4RA^Cn In f + 2^i?Alnf 



^ iEW + i^mf + i^^lnf (9) 



Combining the bounds in (|8]| and ([9]), we have, with a 
probability 1 — 26: 
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E iiMt) - e{M,) <—{R^ + 32R^ In 

+ — 1 ' 



5„.2, m 



+7 E ^{Mt) + jr,A^ In- + RA In 



which is equal to: 



^ https://sites.google.com/site/zljzju/Supplymentary.pdf 



where 

fl ,^, ITT- 5 .2 , rn ^ ^ , m 

c = max < - -I- loin — , -A [n—,RAln — 

[2 4 

The proof is completed by setting 7 = Sr/LA^ . 



